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Abstract 



Lat ent Semantic I ndexing (LSI) (Deerwestcr ct al. 



We consider tho problem of croating document rop 
resentations in which inter-document similarity mea- 
surements correspond to semantic similarity. We first 
present a novel subspace-based framework for formal- 
izing this task. Using this framework, we derive a new 
analysis of Latent Semantic Indexing (LSI), showing 



1990; Dumais, 1991) attempts to overcome this and 



other shortcomings by computing an approximation 
to the original term-document matrix; this is equiv- 
alent to projecting the term-document matrix onto 
a lower-dimensional subspace. LSI has been suc- 
cessfully applied to informatio n retrieval and many 



p, procioo rolationohip between ito performance 



and languag,e analysis tasks (e.g. Dumais and Nielsen 



the lumformity of tho underlying diotribution of dc 



(19921), Fohz and Dumais (1992|), [Berry, Dumais, and 



O'Brien (1995|), |Foltz, Kintsch, and Landaucr (1998^ 



ments o'^^er topics. This analysis helps explain the im 



provements gained by Ando's (2000) Iterative Resid- 



and Wolfe et al. 



ual RcBcaling (IRR) algorithm: IRR can compenoate 



to explain its effectiveness 



(1998)), pro mpting several studies 
e.g 



[mrlrame..^rk k'thatTpro^iden "a woU'moth^tnd; ^^ilou et al~^), and ^ar et al. (200l|)) 



effective method for automatically determining the 
rescaling factor IRR depends on, leading to further 
improvements. A series of experiments over various 
settings and with several evaluation metrics validates 
our claims. 

1 Introduction 

Background The rapid increase in the availability 
of electronic documents has created high demand for 
automated text analysis technologies such as docu- 
ment clustering, summarization, and indexing. Rep- 
resentations enabling accurate measurement of se- 
mantic similarities between documents would greatly 
facilitate such technologies. In this paper, we focus 
on representations in which vector directionality is 
used to represent a document's semantics, by which 
we mean its (human-interpretable) constituent con- 
cepts. This goal is to be accomplished without access 
to concept labels, since they are typically not avail- 
able in many applications. 

The vector space model (VSM) is a classic method 
for constructing such vector-based representations. It 
encodes a document collection by a term-document 
matrix whose [i,j]th element indicates the associa- 
tion between the ith term and jth document. How- 
ever, as has been pointed out previously, VSM does 
not always represent semantic relatedness well; for 
instance, documents that do not share terms are 
mapped to orthogonal vectors even if they are clearly 
related. 



and 



iBartell, Cottrell 

^ ^t^::i!,n^ZnZ^c:^;:if^^i^ ^elew (1994, jStory (1996|)- P'^^^F^. i^^^^- E^^^ 



Ando (2000) introduced an alternative subspace- 
projcction method, which we call Iterative Residual 
Rescaling (IRR), that outperforms LSI by counteract- 
ing its tendency to ignore minority-class documents. 
This is done by repeatedly rescaling vectors to am- 
plify the presence of documents poorly represented in 
previous iterations. However, Ando presented only 
heuristic arguments to explain IRR's success. 

Contributions In this paper, we use the notion of 
subspace projection to formalize the document rep- 
resentation problem. Based on this framework, we 
provide a new theoretical analysis that shows a pre- 
cise relationship between the performance of LSI and 
the uniformity of the underlying distribution of doc- 
uments over topics. As a consequence, we provide an 
explanation for IRR's success: the rescaling it per- 
forms compensates for non-uniformities in the topic- 
document distribution. Moreover, our framework 
yields a new way to automatically adjust the amount 
of rescaling by estimating the non-uniformity. 

To support our theoretical results, we present per- 
formance measurements both on document sets in 
which the topic-document distributions were care- 
fully controlled, and on unrestricted datasets as 
would be found in application settings. In all cases 
and for all metrics, the results confirm our theoret- 
ical predictions. For instance, IRR combined with 
our new parameter selection technique achieved up to 
10.1% higher kappa average precision than LSI and 
up to 8.7% better document clustering performance. 



The experiments as a whole provide strong evidence 
for the usefulness of our framework in general and 
the effectiveness of our augmented IRR in particular. 

Notational conventions A bold uppercase let- 
ter (e.g. M) denotes a matrix; the corresponding 
bold lowercase letter with subscript i (e.g. m^) 
denotes the matrix's iih column vector. We use 
range(M) to denote M's range, or column space: 
{y I 3x such that y = Mx}. When a document col- 
lection has been specified, n denotes the number of 
documents the collection contains. 

2 Analyzing LSI 

2.1 Topic-based similarities 

Our framework for analyzing Latent Semantic In- 
dexing and other subspace-based methods revolves 
around the notion of topic-based similarity. Fix an 
n-document collection C and corresponding m-by-n 
term-document matrix D. We assume that there ex- 
ists a set, denoted topics (C), of fc < n topics underly- 
ing C. We also assume that for each topic t and doc- 
ument d there exists a (real-valued) relevance score 
rel(t, d) , suitably normalized so that for each d, we 
have X]tetopics(C) rel(t, d) = 1. We then define the 
true topic-based similarity between two documents d 
and d' as: 

sim(d,d')= X! rel(t, d)rel(i, d') . 

tGtopics(C) 

It is convenient to summarize these similarities in a 
single n-by-n matrix S, where S[c?, d'] = sim(c?, d'). 

Note that although we assume the existence of un- 
derlying topics as the basis for the true document 
similarities, in contrast t o other analyses (e.g . |Pa- 



jth documents is measured by 



padimitriou ct al. (2000| ), Azar et al. (200"l| ), and 
Ding (1999)) we do not assume that there is an un- 
derlying generative or probabilistic model that cre- 
ates the term-document matrix D. 

2.2 The optimum subspace 

We formulate the ultimate goal of subspace-based al- 
gorithms, such as LSI, as choosing some subspace 
such that projecting D onto this subspace creates new 
document vectors whose measured similarities (i.e., 
cosines) more closely correspond to the true topic- 
based similarities. 

More formally, for any subspace X C 3?™, the 
unique (orthogonal) projection of a vector x e 3?™ 
onto X is given by Pa'(x) ~ BB-^x for any B 
whose columns form an orthonormal basis for X. 
We define the projection of D onto X as the ma- 
trix [P;i'(di) • • • PA'(d„) ], i.e., the result of project- 
ing each of the term-document vectors. Hence, after 
projection onto X , the similarity between the ith and 



cos(P;,(d,),P;t(dj)) 



|P;,(d,)||P;,(d, 



The document representation problem is as follows: 
given D — but not S or even any knowledge of what 
the underlying topics are — find a subspace X such 
that the entries of the deviation matri^ 

diffs,D(A') =S-P;,(D)^P;,(D) 

are small. The optimum subspace 

Xopt ^ argmin ||diffs,D('%')||2 , 

A'Crangc(D) 

with ties broken by smallest dimensionality and then 
arbitrarily, serves as the standard for comparison in 
our analysis. We denote the corresponding projection 
operator by Popt, and use Copt to denote the optimum 
error ||diffs,D('^opt)ll2- Note that eopt need not be 
zero, as it may be impossible to project the given 
term-document matrix in such a way as to perfectly 
recover the true document similarities. 

2.3 The singular value decomposition and 
LSI 

In this section, we first briefly introduce the singu- 
lar va lue decomposition (SVD) ( |Golub and Van Loan 



1996), since singular values are necessary for our anal- 
ysis. Then, we describe LSI, which is based on the 
SVD. 

The SVD factors an arbitrary lank-h matrix Z e 
3?''^* into the following product: 

z = usv"^ , 

where the columns of U e 3?''^'' and V e 3?''^'' 
are orthonormal, S = diag(CTi, (T2, . . . , (Jh) is diago- 
nal (following convention, we assume (Ti > • • • > ct/j), 
and the (Ti's are all positive. The quantities ai, Ui 
and Vi are called the ith singular value, left singular 
vector, and right singular vector, respectively. The 
left singular vectors span Z's range, and cri = ||Z||2. 

Zeroing out all but the i < h largest singular values 
yields the least-squares optimal rank-^ approximation 
to Z. La t ent Semantic I ndexing (LSI) (Deerwester et 
a.1., 199C ; Dumais, 1991 ) applies this rank-^ approxi- 
mation to the term-document matrix D, which corre- 
sponds to projecting D onto the rank-i LSI subspace 
spanned by Ui, . . . , U£: letting — [ui . . . u^ ], 

U,Uf D = U diag(f7i, • • • , a,, 0, • • • , 0) V^. 

h entries 



^It suffices to consider the inner products P Y(d; )-^P 



rather than the cosines because, as shown by Ando (2001), if 
there exists e < 1 that upper-bounds the magnitudes of the 
deviation matrix's entries, then for any d^ and d^ , 



sim(^, j) - 
1 + e 



<cos(P;t-(dO,PA-(d,)) < 



sim(i, j) -\- € 

1 - e 



Note that the fact that this matrix approximates D 
weh does not imply that it represents the true docu- 
ment similarities well, as we shall see. 

Further intuition may be gained on the left singular 
vectors by the following observation. Let proj^-'-' (d^) 
be the projection of onto the span of Ui , . . . , Uj , 

and let rp-* be the residual vector di — proj^^^^-* (d^). 
Then, Uj is the unit vector that maximizes the fol- 
lowing quantity: 



n 



1=1 



In a sense, resembles a weighted average of resid- 
ual vectors, where longer residuals receive greater 
weight. Hence, the £ left singular vectors may be 
thought of as representing the £ major directions in 
the document collection. 

2.4 Non-uniformity and LSI 

We now state our results relating the non-uniformity 
of the underlying topic-document distribution to the 
quality of the document representation spaces derived 
using LSI. The main outline of our argument is to first 
show relations between certain singular values and 
certain quantities linked to our topic model, and then 
show how the distance between the LSI-subspace and 
the optimal subspace relates to these singular values. 
Proofs of these results, which make use of invariant 



subspace perturbation theorems (Davis and Kahan, 
197C; ^tewart and Sun, 1990 ; Golub and Van Loan 
1996), are sket ched in the a ppendix of this paper and 
given in full in [Ando (200l| ). 

Recall that we are dealing with a fixed document 
collection C with k underlying topics. Throughout, 
we use h to denote the dimensionality of the optimum 
subspace. For clarity, we will abuse notation by writ- 
ing "a; S 2/ ± z" as shorthand for "a: e [y — z,y + z]" . 

A crucial quantity in our analysis is the dominance 
At of a given topic t: 



At^ Y.'^\{t,d)^. 
V dec 

(It may be helpful to observe that in the special 
"single-topic documents" case, where each document 
is relevant to only one topic in topics(C), squaring At 
gives exactly the number of documents in C that are 
relevant to topic t.) We assume without loss of gener- 
ality that Ai > A2 > • • • > Ak, and for convenience 
set Ai = ii i > k. 

Now we show that the projection of D onto 
the optimum subspace Xopt in some sense reveals 
the topic dominances. It is intuitively clear, how- 
ever, that the extent to which this holds should 
depend to some degree both on the optimum 
error Copt and on the topic mingling /i(C) — 



t'etopics(C),t5!^ 



t' (Edecrel(i,d)rel(i',rf))' 



1/2 



(note that in the single-topic documents case, 
fJ-{C) = 0). Certainly, if the optimum error is high, 
then we cannot expect the optimum subspace to 
fully reveal the topic dominances; also, if there is 
high topic mingling in the collection, then the topics 
will be fairly difficult to distinguish. 

Theorem 2.1 Let Ti he the ith largest singular value 
o/Popt(D). Then, rf e Af ± {copt + Ai(C)). 



This result gives us leave to define A^ 



Ti and 



A„im = Th , where h is the dimension of Xopt ■ The ra- 
tio Amax/ Amin then scrvcs as a measure of the non- 
uniformity of the topic-document distribution under- 
lying the collection: the more the largest topic domi- 
nates the collection, the higher this ratio will tend to 
be. 

Now we are in a position to present our main 
theorem. This result bounds the distance be- 
tween the optimum subspace Xopt and the same- 
dimensionality LSI subspace X^si by a function of 
the non- uniformity of C's topic-document distribu- 
tion. The bound also incorporates a certain value 
(defined precisely in the appendix) eysm € ^vsm^^opt, 
where e^sm = ||diffs,D('%'ySM)||2 = ||S - D^D||2 is 
the input error, intuitively, if a "bad" term-document 
matrix is received as input, one cannot expect LSI to 
do well. 

Theorem 2.2 Let X^si be the h-dimensional LSL 
subspace spanned by the the first h left singular vec- 
tors OfT). //A,; 



then 



\t&^mXLSI,Xopt))\\2 < 



A, 



A, 



1 — ( %/ £vsm/ A 

rmn ) 



where Q is the canonical angle matrix measuring the 
distance between subspa ces (Davis and Kahan, 197L; 
Stewart and Sun, 199C). 



Intuitively, what this means is that X^si rnust 
be close to Xopt when the topic- document distribu- 
tion is relatively uniform and the input error is small 
in comparison to the hth largest topic dominance. 
Conversely, if the input error is fixed, our bound 
on LSI's performance weakens when the underlying 
topic- document distribution is highly non-uniform. 
Finally, we note that the condition on is natural: 
roughly speaking, if the input error is large enough 
to "swamp" the dominance of the hth largest topic, 
then intuitively we cannot expect good results. 

Finally, we note that a related result can be 
proved which, roughly speaking, links lower bounds 
on the distance between the two subspaces to non- 
uniformity and the input error; however, this theo- 
rem is quite technical i n nature an d thus is omitted. 
We refer the reader to Ando (2001 ) for details. 



2.5 Related work: theoretical analyses of 
LSI 



one, via the following computation: 



As noted above, there have been several studies an- 
alyzing LSI, usin g approache s such as Bayesian re- 
gression mo dels (Story, 1996) and Gaussian models 
( Ding, 1999 ). Here, we concentrate on describing the 
w ork most similar in spirit to ours. 

Zha, Marques, and Simon (1998) propose a 



u 



subspace-based model for LSL Their work focuses on 
dimensionality selection and implementation issues 



regarding more accurate u pdating schemes. Bartell 



Cottrell, and Belew (1992) show that LSI can be re- 
garded as a solution to the special Multidimensional 
Scaling (MDS) problem of preserving the inner prod- 
ucts of the original document vectors. However, as 
noted above, this is not the same as recovering hid- 
den topic-based similarities, especially in the case of 
noisy data. 

Perhaps the work most s imila r to ours is that o f 



Papadimitriou et al. (2000| ) and |Azar et al. (200l|) 



both of which propose to explain LSFs success with 
analyses that, like ours, employ invariant subspace 
perturbation theorems. Papadimitriou et al. start 
with a probabilistic corpus model. By assuming low 
input error and certain conditions on singular val- 
ues (which, from our perspective, can be considered 
to be roughly equivalent to assuming relative unifor- 
mity, although Papadimitriou et al. did not explicitly 
make this connection), they show that LSI will work 
well with high probability. But their results are based 
on a pure probabilistic corpus model in which all the 
documents are single-topic and topics have associated 
primary (distinguishing) disjoint sets of terms. Thus, 
their analysis holds only for a very restricted class of 
document collections. Similarly, Azar et al. also start 
with particular conditions and a specialized underly- 
ing generative model to show that LSI works well for 
"good" documents with high probability. In contrast 
to these approaches, our analysis does not assume a 
model of term-document matrix creation, and so ap- 
plies to arbitrary term-document matrices, with the 
non-uniformity and the input matrix's quality being 
explicit terms in our bound. 

3 IRR: Overcoming non-uniformity 



Our results from Section 2.4 indicate that we could 



improve the performance of LSI if we could some- 
how "smooth" the topic-document distribution (that 

is, effectively lower Amax/^min)- We show that the 
Iterative Residual Rescaling (IRR) algorithm, intro- 
du ced (but not named) and heuristically motivated 
by Ando (2000 ), accomplishes this task without prior 
knowledge of the assignments of documents to topics. 

3.1 Ando's IRR algorithm 



Recall from Section 2.3 that the left singular vectors 
Ui, U2, . . . produced by LSI can be derived, one by 



.J — argmax 

|x| = l 



i=l 



lo) I 



cos(r. 



where the r'f^ are the residuals — proj*--'"^-' (d^). 
Unfortunately, inspection of this formula shows that 
when the topic-document distribution is highly non- 
uniform, the cumulative influence of a large num- 
ber of (small) residuals for a major topic can cause 
smaller topics to be ignored, as depicted in Figures 
|(a) and (b). 

Our explanation of IRR's effectiveness is that by 
amplifying the length differences among residual vec- 
tors, IRR boosts the influence of minority-topic docu- 
ments, thus compensating for non-uniform topic dis- 
tributions (Figure |l](c)). More precisely, IRR, like 
(our formulation of) the SVD, computes basis vec- 
tors by successive maximizations, as shown in the 
pseudocode in Figure |. Crucially though, IRR's 
objective function, g^-'^ , incorporates a scaling factor 
q via the scaling function pow(r, q) = jrl' r: 



9^^\^) = XI (lP0w(r|^%g)| cos(pow(r|^\g),x) 



This is maximized by the first left singular vector of 

R'"*^ = [pow(rJ-'\ (7) ••• \30w{Yn\q)'\ ■ 

That is, IRR rescales each residual vector r^ at each 
basis vector computation, increasing the contrast be- 
tween long and short residuals when q > Q. LSI is 
the special case in which q — Q. 

3.2 The AUTO-SCALE method 

Our discussion above argues that the degree of rescal- 
ing should depend on the uniformity of the topic- 
document distribution. Ando (2000) did not explic- 



itly make this connection, and hence could not take 
advantage of it: q was determined simply by train- 
ing on held-out data. In contrast, our novel anal- 
ysis allows us to exploit this connection to develop 
an effective estimation method — automatic scaling 
factor determination (auto-SCALe) — that approx- 
imates the topic-document non- uniformity without 
prior knowledge of the underlying topics. 

AUTO-SCALE is based on the observation that we 
can use the quantity X]tetopics(c) /"-^ ^ measure 
of the non-uniformity of the topic-document distri- 
bution. Of course, we don't have access to the topic 
dominances Af , but we can approximate this measure 



The Probenius norm \\^\\f is defined as j] 



LSI' s first basis vector u ^ points in 
the dominant direction. 
After computing residuals... 



(a) 



... the next LSI basis vector is still 
biased towards dominant-topic 
vectors, despite being orthogonal to u ^ . 

(b) 



Reseating the residuals 
boosts the influence of 
minority-topic documents. 



(c) 



Figure 1: Effect of nonuniformity on LSI, and how IRR compensates. 



IRR(g,^): 

R:= D /* initialize residuals by given m-by-n term-document matrix */ 
For j :— 1,2, . . . ,£ /* create £ basis vectors */ 
For i :— 1,2, ... ,n 

fi := jr^l'^ri /* rescale residuals */ 

hj argmaxx:|x|=i (Er=i (\^^\ cos{r„x.)f 
For i := 1,2, . . . ,n 

Subtract from its projection onto 



D 



IRR 



:= BB^ D, where B = [bi 



/* recompute residuals */ 
hi] /* new document representation */ 



Figure 2: High-level pseudocode for Ando's IRR (i.e. before augmentation with AUTO-SCALe). 



This approximation follows from first assum- 
ing that the input matrix is fairly good, so 



that IID^D 



|S|||. 



is roughly equal to 

Ed.d'ec (Etetopic.(C) rel(i, d)ie\{t, d')f . The lat- 
ter can be rewritten, after some algebra, as the 
quantity EtGtopics(C) + mCC)"*, which is roughly 
Xltetopics(c) '^^ assume approximately single- 
topic documents. In practice, we set q to a linear 



function of /(D); this is discussed in Section 5.1 



Although the above assumptions are rather coarse, 
AUTO-SCALE yields good empirical results: see Sec- 
tions H and ^. 

3.3 Dimensionality selection 

IRR's second parameter is £, the dimensionality of the 
created subspace. One way to set th is parameter is to 
train it on held-out data. Following |Ando (2000 ), we 
found that learning thresholds on the residual ratio 
I IR^"'-' I ll^/fi as a stopping criterion is effective for both 
LSI and IRR. Intuitively, this ratio describes how 
much is left out of the proposed subspace (of course, 
we do not want to reproduce the term-document ma- 
trix exactly — hence the threshold). Note that this 
training method allows some flexibility in the cho- 
sen dimensionality: for different data sets, the same 
residual ratio threshold may result in selecting a dif- 
ferent £. 

While training on held-out data is reasonable and 



is commonly employed in practice, it is a relatively 
expensive process. A speedier alternative arises in 
settings in which k, the number of topics, is pre- 
specified — examples include cases where the topic 
set is a fixed class such as the TREC topic labels, or 
where the application allows the user to specify the 
appropriate level of granularity for his or her needs. 
In such settings, we could simply set the dimension- 
ality equal to fc as a matter of convenience. (Indeed, 
Papadimitriou et al. (200C| ) show that under certain 
strong assumptions on the data, ran k-fc LSI should 
perform well. See also Ando (2001).) We describe 



experiments with both selection methods below. 
4 Evaluation metrics 

Kappa average precision Our first evaluation 
metric is bas ed on the pair-wise average precision 
( Ando, 2000 ), adapted from the average precision 
measure commonly used in information retrieval. 
The motivation behind this metric is that the mea- 
sured similarity for any two intra-topic documents 
(i.e., that share at least one topic) should be higher 
than for any two eross-topic documents which have 
no topics in common. More formally, let pi denote the 
document pair with the ith largest measured similar- 
ity (cosine). Precision for an intra-topic pair pj is 
defined by 



prec(pj) 



# of intra-topic pairs Pi such that i < j 





topic 1 


topic 2 


topic 3 


topic 4 


cluster 1 


5 


10 


20 





cluster 2 


5 


10 


5 





cluster 3 











21 


cluster 4 


15 


5 








cluster 5 











4 



Figure 3: Sample contingency table, with s(C) = 
(15 + 20 + 21)/100 = 56%. 

The pair-wise average precision is the average of these 
precision values over all intra-topic pairs. 

To compensate for the effect of large topics (which 
increase the likelihood of chance intra-topic pairs), 
we modify the pair-wise average precision to create 
a new metric, which we call t he kappa precision in 
reference to the Kappa st atistic ( Biegel and Castellan 
Jr., 1988| ; [Carletta, 1996|) : 



prec^(K) 



prec(pi) — chance 
1 — chance 



where chance = (# of intra-topic pairs)/(# of docu- 
ment pairs). The kappa average precision k is defined 
to be the average of the kappa precision over all intra- 
topic pairs, and is a linear function of the pair- wise 
average precision. 

Clustering We also test how well the new subspaces 
represent document similarities by seeing whether 
document clustering improves when these new repre- 
sentations are used as input. To simplify the scoring, 
we consider only single-topic documents. 

Let C be a cluster-topic contingency table such 
that C[z,j] is the number of documen ts in cluster 
i that are re levant to topic j, as in ^lonim and 
Tishby (2000| ). We define s{C) = Y.i,jJ^rjln, where 



Nij = C[ijj] if C'[i,j] is the unique maximum in both 
its row and column, and Afij = otherwise. Note 
that this (rather strict) measure only considers the 
most tightly coupled topic-cluster assignment, and 
decreases when either cluster purity or topic integrity 
declines (see Figure ||). 

To factor out the idiosyncracies of particular clus- 
tering algorithms, we apply six standard clustering 
methods — single-link, complete-link, group average, 
and k-means with initial clusters generated by these 
three methods — to the document vectors in each 
proposed subspace, and record both the ceiling (high- 
est) and floor (lowest) s(C) scores. While the ceihng 
performance is perhaps more intuitive, we observe 
that the floor performance also gives us important 
information about the quality of the representation 
being evaluated: if the floor is low, then there is at 
least one clustering algorithm for which the docu- 
ment subspace is not a good representation; other- 
wise, the representation is good for all six clustering 
algorithms. 



5 Controlled distributions 

Our first suite of experiments studies the dependence 
of LSI and IRR on increasingly less uniform topic- 
document distributions. The results strongly support 
our theoretical analysis of LSl's sensitivity to non- 
uniformity. 

5.1 Experimental setting 

To focus on distributional non-uniformity, we first 
chose two TREC topics, and then specified seven dis- 
tribution types: (25, 25), (30, 20), (35, 15), (40, 10), 
(43, 7), (45, 5), and (46, 4), where (ni,n2) indicates 
that Ui of the documents are relevant to topic i. For 
each of these types, we generated ten sets of 50 TREC 
documents each, where each document was relevant 
to exactly one of the pre-selected topics. We also 
created five-topicQ data sets in the same manner, us- 
ing distribution types of the form (i, j, j, j, j) (which 
makes uniformity comparisons obvious). 

To create the term-document matrices, we ex- 
tracted single-word stemm ed terms using TALENT 
(Boguraev and Neff, 200C), removed stopwords, and 
then length-normalized the document vectors (so 
that term weights were frequency-based). 

To implement AUTO-SCALE, we set q = a- f{'D)-\-/3, 
where a = 3.5 and /3 = for all our experiments. 
These values (which are necessary to determine the 
"units" of the scale factor) were empirically deter- 
mined once and for all from observations on data dis- 
joint from our test sets. This contrasts with train- 
ing q for every new test set encountered, as in Ando 



[2000). Training is an expensive process, and we envi- 
sion interactive applications such as organizing query 
results (a task we simulate in Section ^) in which 
what would serve as training data is not obvious. We 
thus view AUTO-SCALE as a practical alternative to 
the usual parameter training. 

For simplicity, the dimensionality of LSI and IRR 
in our experiments was set to the number of topics.^ 

5.2 Controlled-distribution results 

We first examine the kappa average precision re- 
sults, shown in Figure ^. The a:;-axis represents 
the nonuniformity of the topic-document distribu- 
tion, as measured by Amax/ ^min- We see that when 
the topic-document distribution is relatively uniform, 
LSl's performance is higher than 90%. However, as 
the nonuniformity increases, the performance of LSI 
drops precipitously, in accordance with our theorems 
above. 

Also, our interpretation of the scaling factor q as 
compensating for non-uniformity is borne out nicely. 
For highly uniform distributions, the performance 

•^Results for three- and four-topic document sets were sim- 
ilar and are therefore omitted. 

* Our preliminary experiments with dimensionality training 
indicated that indeed this was often the best dimensionality for 
LSI and almost always the best for auto-scale-IRR. 
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Figure 4: Kappa average performance, two topics. 
Points are averages over ten document sets. 
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Figure 5: Automatically-determined scaling factor 
values: ten-set average and standard deviation. 

difference between g = (at which IRR — LSI), 
g = 2, and q = A is not great. At medium nonunifor- 
mity, q = degrades, but q = 2 still does about the 
same as g = 4. But as the non-uniformity increases 
even more, we see that g = 2 is not large enough to 
compensate, and so declines in comparison to q = A. 

Furthermore, we see that IRR with AUTO-SCALE 
(labelled 'IRRic/ = auto') does extremely well across 
all levels of non-uniformity. Figure o shows that 
AUTO-SCALE indeed adjusts for more non-uniform 
distributions: the chosen scaling factor increases on 
average as the non-uniformity goes up. 

Now, one might conjecture that instead of using 
AUTO-SCALE, it would sufRce simply to choose a sin- 
gle very large value of q. Intuitively, though, this 
is problematic, since too high a scaling factor would 
tend to completely eliminate residuals. Furthermore, 
the q = 20 curve in Figure ^disproves the conjecture: 
in the uniform case, selecting an overly large scaling 
factor hurts performance, driving it below the base- 



Figure 6: Floor and ceiling clustering results, two 
topics. Points are averages over ten document sets. 

line VSM curve. 

The two-topic floor and ceiling clustering results, 
shown in Figure |[ exhibit precisely the same types 
of behaviors as in the kappa average precision case. 
The floor performances are especially interesting, as 
they show that AUTO-SCALE-IRR exhibits very good 
performance for all six of our rather wide variety of 
clustering algorithms. They also indicate that VSM 
is 'fragile' for uniform distributions, in that some- 
times it is a very poor representation for at least one 
of the clustering algorithms we employed. 

Finally, Figure |^ shows the results of the same eval- 
uation experiments run on five-topic data. Again, the 
empirical results are completely in line with what we 
predicted, with AUTO-SCALE leading to strong per- 
formance over all metrics and all degrees of non- 
uniformity. Note that the gap between LSI and VSM 
decreases in comparison to the k — 2 case; this is 
due to the fact that at higher dimensionalities, the 
subspace produced by LSI gets closer to that of the 
original term-document matrix. 

These results all strongly support our theoretical 
claims. 

6 Unrestricted distributions 

In this section, we experiment on the more realistic 
setting of document sets without distribution restric- 
tions. We expect that in practice, topic-document 
distributions will be fairly non-uniform, so that IRR 
should perform well in comparison to LSI. Figure |^ 
summarizes the evaluation settings. 

We used 648 TREC documents, each relevant to 
exactly one of twenty TREC topics. To perform pa- 
rameter training, we randomly divided these docu- 
ments into two disjoint document pools. We then 
simulated input from an information retrieval ap- 
plication by generating 15 document sets from each 
pool, where each set consisted of those documents 




Figure 7: Ten-set averages of kappa average precision and floor and ceiling clustering results, five topics. 
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Figure 8: Evaluation settings for unrestricted distri- 
butions. Recall that k is the number of topics and £ 
is the dimensionality. 



ing inter-document similarities. LSI performs rela- 
tively poorly on this task; indeed, using k dimensions 
in the LSI case leads to worse results than VSM. 
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Figure 9: Thirty-set average absolute improvement 
in K over VSM (51.4%), unrestricted distributions. 



containing one of 15 arbitrarily chosen keywords; this 
yielded a total of 30 document sets. Document sets 
from one pool were used as parameter training data 
for the sets from the other pool, and vice versa. 
Performance results are averages over these 30 runs. 
The scaling factor for IRR was determined by AUTO- 
SCALE in all cases (again with the same constants a 
and P as before). The term-document matrices were 



created in the same manner as in Section 5.1 



6.1 Kappa average precision results 

Recall that we consider two ways to choose the di- 
mensionality of a document subspace. In the first 
case, the system knows k, the number of topics un- 
derlying the collection (in practice, this information 
could be user-supplied as a way to control topic gran- 
ularity, or given by a set of predetermined classifica- 
tion labels), and sets the dimensionality to it. In the 
second case, k is considered unknown, so we simply 
train the dimensionality parameter using the residual 
ratio method described in Section 3.2. 

From Figure ^, we see that IRR yields higher k 
than LSI and VSM for both dimensionality selection 
methods, and therefore does a better job at represent- 



6.2 Clustering results 

To derive fioor and ceiling clustering performance re- 
sults, there are two parameters we need to specify: 
the dimensionality of the subspace, and the number 
of clusters. 

If k, the number of topics, is available, then it is 
the natural choice for the number of clusters. Then, 
to choose the dimensionality in this case, one option 
is to also set it to fc; Figure ^0|(a) shows the results. 
We see that IRR has the best clustering performance 
overall. Note that LSI's ceiling is actually lower than 
VSM's. 

When k is given but we train the dimensionality 
via the residual ratio, IRR still provides a better sub- 
space for all the clustering algorithms we considered, 
both in terms of floor and ceiling performance (Fig- 
ure |l^(b)). We observe that for this type of data, 
training the dimensionality allows LSI to produce im- 
proved ceiling results. 

We now consider the case in which k is unknown. 
In this situation, we know of no alternative but to 
train the dimensionality on held-out data. As for the 
number of clusters, a reasonable default is to sim- 
ply set this value to the trained dimensionality. Of 
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Figure 10: Document clustering performances, un- 
restricted distributions: averages over 30 runs. 

course, this doesn't apply to VSM, since the dimen- 
sionality is not a free parameter for it; instead, we set 
the number of clusters to the average of the number 
of topics in the training document sets. 

Figure |l^(c) shows the clustering results for the 
unknown-fc setting. LSI's ceiling degrades by 4.3% 
compared with when the number of topics is given, 
while those of VSM and IRR show almost no change. 
Furthermore, IRR clearly outperforms the other 
methods. 

6.3 Discussion 

In our experiments, LSI did worse or essentially the 
same as VSM in 4 out of 8 combinations of practical 
settings and metrics. In particular, when the dimen- 
sionality is chosen to be the number of topics, LSI 
performs relatively poorly. Dimensionality training 
improves LSI's kappa average precision scores, and 
also improves its clustering performance with respect 
to VSM as long as the correct number of clusters (i.e. 
the number of topics) is given. However, when the 
number of clusters is unknown, LSI's ceiling cluster- 
ing performance drops, again indicating that for LSI 
the dimensionality should not be tied to the number 
of clusters. 

In contrast, IRR consistently performs better than 
LSI and VSM for all our settings and metrics. In 
particular, IRR fares relatively well when the dimen- 
sionality is set to the number of topics as compared 
to when the dimensionality is actually trained. These 
results suggest that setting the dimensionality to the 
number of topics, when known, may be a practical al- 
ternative to dimensionality training. Furthermore, in 
clustering applications for which the number of top- 
ics is not known, we at least might be able to reduce 
the training effort by only searching for the dimen- 
sionality, setting the number of clusters to the same 
value. 



7 Conclusion 

To conclude, we review our three main results. First, 
we have provided a new theoretical analysis of LSI, 
showing a precise relationship between LSI's perfor- 
mance and the uniformity of the underlying topic- 
document distribution. Second, we have used our 
framework to extend Ando's (2000) IRR algorithm 
by giving a novel and effective method for determin- 
ing the requisite scaling factor. Third, we have shown 
that IRR, together with our parameter determination 
method, provides very good performance in compari- 
son to LSI over a variety of document-topic distribu- 
tions and applications-oriented metrics. 
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A Appendix 

We use the following perturbation result on singular 
va lues, which is a slight rewri ting of Corollary 8.6.2 



Golub and Van Loan (1996 ) 



Theorem A.l Let E = Xi - X2, where Xi,X2 G 
W'''^ and r > s, and for 1 < i < s, let (t|^'' and 

(Tj denote the ith singular values of Xi and X2, 
respectively. Then, 



.(1) 



.(2) I 



< IIEII, < IIEI 



Notational conventions If we refer to a singular 
value Gj of a matrix Z with rank less than j, it is un- 
derstood that (Tj =0. (This corresponds to the "zero- 
padded" version of the SVD that is also commonly 
used; we presented the non-padded version above for 
conceptual clarity.) 

Throughout, denotes the ith singular value of 
Popt(D). 



A.l Proof of Theorem 2.1 



Let pi, . . . , pnhe the singular values of the true topic- 
based similarities matrix S. For convenience, fix a 
particular i S {1, . . . , rt}. 

Since diffs D(A'opt) = S - P<,pt(D)^P<,pt(D), by 



Theorem AT we have £ pi ± topt- 
Next, define the matrix S' G 



by 



S'[il,i2] 



dec 



rel(ii, d)rel(t2, d) , 



where we define rel(t, d) = for t > k and d E C. 
One can verify that the singular values of S' are the 
same as the singular values of S: pi, . . . , p„. 

Now, consider 
diag(Af,...,A2). 



values of diag(A2, . 
more, observe that 
in the collection. 



the matrix E 

We note that 
,2^ A 2 



the singular 
,Al. Further- 



.,Ai) are Af,. 

E||i? — p{C), the topic mingling 
Therefore, by again applying 
Theorem |AT| we find that pi € A^ ± /^(C). 

Combining our partial results yields the desired re- 
sult: 

e A2 ± (e„p, + p{C)) . 



A. 2 Invariant subspace tangent theorem 



The proof of Theorem 2.2 is based on the follow- 
ing s implification of the Dav is-Kahan tangent theo- 
re m (Davis and Kahan, 1970 ) (see also Theorem 3.10 
of [Stewart and Sun (199C| )). 



Theorem A. 2 Let A G W^'^ be a symmetric ma- 
trix, and /et [ X Y ] be an orthogonal matrix, with 
X G ^^^P, so that range(X) forms an invariant 
subspace of A (i.e. x G range(X) implies Ax e 
range(X)J. For any matrix'K. € W^'^ with orthonor- 
mal columns, we define the residual matrix R o/ X 
as 

R = AX - (XX^)AX . 

Suppose the eigenvalues of X"^AX lie in the range 
[a, P] and that there exists S > such that the 
eigenvalues of Y-^AY either all lie in the interval 
(— oo, a — 6] or are all in [f3 + S, oo). Then, 



tan(e(range(X),range(X))||2 < 



IRI 



Finally, it can be shown that | |R| I2 1^ ^max'V^ 
which yields the desired result: 



\taniQiXLsi,X, 



opt) 



< 



< 



A. 



A2 . 

mm 
^max 



1 {v^^usm / ^min^'^ 



A. 3 Proof sketch for Theorem 2.2 



Here, we outline the proof (given in full in Ando 



2001)). The main idea is to apply Theorem A. 2 
by choosing X and X so that range(X) = Xlsi and 
range(X) — Xopt, and setting A = DD-^, for which 
the LSI subspace is invariant. 

D— D; note that 
Let (Ti denote the ith singular 



Let D — Popt(D), and define D 
— T ^ 
D D 



D^D = D D = 0. 

value of D. ^ 

First, consider the largest singular value of D D, 
which we will denote by evsm- It can be shown 



that 



G e. 



of notation. 



vsra ^ ^opt^ 



thus justifying our choice 
(First show that diffs,D (-^opt) = 

diffs .D(^vsA/) ^ (^D D), and then apply Theorem 
|A.1| to this equation and observe that the largest sin- 
gular value of diffs,D('^ysAf) is ^vsm-) 

Then, it can be shown that (Th+i < V^vsm- apply 
Theorem ^ to the equation D^D = D^D - D^D 
and then note that Th+i — because of the dimen- 
sionality of Xopt- 



Now, to apply Theorem |A.2| , choose the columns 
of X to be the first h left singular vectors (in order) 
of D, and choose the columns of X to be all h left 
singular vectors (in order) of D. We set the columns 
of Y to the rest of the m — h left singular vectors (in 
order) of the "zero-padded" version of the SVD of D. 

Some linear algebra reveals that the eigenvalues of 
Y^AY are no greater than cr^_|_i, and the eigenvalues 

of X-^AX are no smaller than A^j„. Therefore, we 



set S ^ A'i 



'h+l- 



Note that (5 > Ai , 



by above, and so is positive by assumption. Hence, 
Theorem [A.2| applies. 



